The IIT Bombay English-Hindi Parallel Corpus
نویسندگان
چکیده
The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi compiled from a variety of existing sources as well as corpora developed at the Center for Indian Language Technology1, IIT Bombay over the years. The training corpus consists of sentences, phrases as well as dictionary entries, spanning many applications and domains. The details of the training corpus are shown in Table 1. The sub-corpora in the download archive are in the same order as listed in the table, so they can be separately extracted if required. We briefly describe some sub-copora which have not been described in previous literature. Judicial domain corpus I consists of translations of legal judgements by expert translators, though not with a legal background. Judicial domain corpus II contains translation done by students taking a graduate course on natural language processing as part of a course project. Mahashabdkosh2 is an online official terminology dictionary website which is hosted by Department of Official Language, India. It contains Hindi as well as English terms along with definitions and example usage which are translations. The Indian Government corpora has been manually collected by CFILT from various websites related to the Indian government like the National Portal of India, Reserve Bank of India, Ministry of Human Resource Development, NABARD, etc. The test and dev corpora are newswire sentences, which are the same ones as used in the WMT 2014 English-Hindi shared task (Bojar et al., 2014a). The training, dev and test corpora consist of 1,492,827 and 520 and 2507 segments respectively. The corpora can be downloaded from http://www.cfilt. iitb.ac.in/iitb_parallel. We recommended the use of the following monolingual corpora for training language models the corpora compiled by Bojar et al. (2014b) for Hindi, and the corpora provided by the WMT shared tasks 3 for English.
منابع مشابه
The IIT Bombay Hindi-English Translation System at WMT 2014
In this paper, we describe our EnglishHindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating mo...
متن کاملHindi Word Sense Disambiguation
Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai India {manish, mahesh, pb,pandey,yupu}@cse.iitb.ac.in Abstract Word Sense Disambiguation (WSD) is defined as the task of finding the correct sense of a word in a specific context. This is crucial for applications like Machine Translation and Information Extraction. While the work on automatic WSD for En...
متن کاملSupporting Large English-Hindi Parallel Corpus using Word Alignment
This paper gives description about methodology to understand parallel English-Hindi sentences using word alignment. This methodology is foundation to develop the parallel EnglishHindi word dictionary after syntactically and semantically analysis of the English-Hindi source text. Methodology of proposed system is used for the English and Hindi sentences; also the methodology can be used for othe...
متن کاملUrdu and Hindi: Translation and sharing of linguistic resources
Hindi and Urdu share a common phonology, morphology and grammar but are written in different scripts. In addition, the vocabularies have also diverged significantly especially in the written form. In this paper we show that we can get reasonable quality translations (we estimated the Translation Error rate at 18%) between the two languages even in absence of a parallel corpus. Linguistic resour...
متن کاملUse of Sense Marking for Improving WordNet Coverage
WordNet is a crucial resource that aids in several Natural Language Processing (NLP) tasks. The WordNet development activity for 18 Indian languages has been initiated in INDIA by the IndoWordNet1 consortium using the expansion approach with the Hindi WordNet developed by IIT Bombay, as the source. After linking 20K synsets, it was decided that each of these languages should find the coverage o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.02855 شماره
صفحات -
تاریخ انتشار 2017